This practical exercise builds upon existing ggplot2 resources.
The first part of the notebook will walk you through the the different elements that make up a plot and highlight tips that will help with practical code implementation of ggplot2. It is a condensed version of an online workshop delivered by a member of the Tidyverse team, Thomas Lin Pedersen, which can be viewed on YouTube.
The second part of the notebook provides a practical framework to develop a consistent design style for yourself or your organisation.
ggplot2 is an implementation of the theoretical framework for constructing data visualisations known as the Grammar of Graphics.
Rather than thinking of data visualisations as entirely separate entities, e.g., box plot, line chart, bar chart, scatter plot, the Grammar of Graphics breaks all data visualisations into eight constituent parts, which are layered on top of each other. The theory explains the relationships between these eight parts:
This approach means you are not limited to a set of predefined data visualisations. You can build plots from the bottom up, and tailor them however you prefer.
In the sections below we will use the Grammar of Graphics framework to explore how to create plots with ggplot2.
This document comes with a list of required libraries: ‘ggplot2’ and ‘dplyr’. We will use two of the build-in data sets ‘diamonds’ and ‘iris’.
ggplot() initialises a ggplot object+ to combine ggplot2 elements# Data 1
# Data can be added to ggplot(). If set here the data and mapping values will be inherited by subsequent layers.
ggplot(data = iris,
mapping = aes(x = Sepal.Length, y = Petal.Length)) +
geom_point()
# Data 2
# Another method is to set data in the ggplot() element and the mapping in the geom layer.
ggplot(data = iris) +
geom_point(mapping = aes(Sepal.Length, Petal.Length))
# Data 3
# You can also set both the data and the mapping at the geom layer
ggplot() +
geom_point(data = iris, mapping = aes(x = Sepal.Length, y = Petal.Length))
# Data 4
# It is common for the arguments not be used
ggplot(iris) +
geom_point(aes(x = Sepal.Length, y = Petal.Length))
Note: to create the plot in the example above you only needed to:
- Specify which data to use
- Select the aesthetic mappings
- Select a geometry
This is because there are a lot of sensible defaults written into the ggplot2 package that handles the other elements of the plot.
geom_point() requires x and y values and geom_histogram() only requires x values). If you are unsure of a geom’s mapping requirements you can check the ‘Aesthetics’ section of the help documentation (e.g., ?geom_point).# Mapping 1
# You can map colour to your data using the colour aesthetic.
# We can use an expression to map the colour, as the x and y values are not passed in as strings. Note that a legend is created when you map a variable to colour.
ggplot(data = iris) +
geom_point(mapping = aes(x = Sepal.Length,
y = Petal.Length,
colour = Petal.Length >2))
# Mapping 2
# It is important to remember the difference between mapping a colour and setting a colour
# This is how you set the colour on a plot
ggplot(data = iris) +
geom_point(mapping = aes(x = Sepal.Length,
y = Petal.Length),
colour = "purple")
# Mapping 3
# Some geometries use fill to set the colour inside the graphic
ggplot(data = iris) +
geom_histogram(mapping = aes(x = Sepal.Length),
fill = "purple",
colour = "navy")
# Mapping 4
# This is how **NOT** to set the colour of a plot
# Do you know what is happening here?
ggplot(data = iris) +
geom_point(mapping = aes(x = Sepal.Length,
y = Petal.Length,
colour = "purple"))
Remember: If you want to map a colour from the data put it in
aes()if you want to set a colour or size put it outsideaes().
Update the code to make the points larger purple triangles and slightly transparent.
ggplot(iris) +
geom_point(aes(x = Sepal.Length, y = Petal.Length))
Colour the two distributions in the histogram with different colours.
ggplot(iris) +
geom_histogram(aes(x = Petal.Length))
# Geometries 1
# In this example you can see that the points sit on top of the line
ggplot(data = iris) +
geom_line(mapping = aes(x = Sepal.Length, y = Petal.Length), colour = "pink") +
geom_point(mapping = aes(x = Sepal.Length, y = Petal.Length))
# In this example the line sits on top of the points
# This is due to the order of the geometries
ggplot(data = iris) +
geom_point(mapping = aes(x = Sepal.Length, y = Petal.Length)) +
geom_line(mapping = aes(x = Sepal.Length, y = Petal.Length), colour = "pink")
The statistics part refers to the calculations that are happening to transform your raw data into the values we want to display on the plot. For example, a bar chart calculates counts and box plots calculate a range of distribution values. - Every geometry has a default statistic. - Every statistic has a default geometry - Both can be used but people are used to thinking in geometries. You visualise bars rather than counts. - You can access the values generated through the stat by using the after_stat() function.
#geom_bar() uses stat_count() by default
ggplot(data = diamonds) +
geom_bar(mapping = aes(x=cut)) +
ggtitle("geom_bar()")
# we get the same plot because the default geom for stat_count() is geom_bar()
ggplot(data = diamonds) +
stat_count(mapping = aes(x=cut)) +
ggtitle("stat_count()")
# but you can update the geom to something else - normally you wouldn't want to override this.
ggplot(data = diamonds) +
stat_count(mapping = aes(x=cut), geom = 'point') +
ggtitle("stat_count()")
# You can calculate the statistics yourself and plot the results
diamonds_counted_by_cut = diamonds %>%
group_by(cut) %>%
summarise(count = n())
ggplot(data = diamonds_counted_by_cut) +
geom_bar(mapping = aes(x=cut, y=count),
stat = 'identity') + # identity stat doesn't compute anything
ggtitle("stat = 'identity")
# ggplot2 provide geometries for the most frequently used combinations of statistics and geometries. For example, geom_col() is geom_bar() with the stat set to "identity".
ggplot(data = diamonds_counted_by_cut) +
geom_col(mapping = aes(x=cut, y=count)) +
ggtitle("geom_col()")
# You can access the values generated through the stat by using the after_stat() function.
# Here we will change the counts to percentages
#geom_bar() uses stat_count() by default
ggplot(data = diamonds) +
geom_bar(mapping =
aes(
x = cut,
y = after_stat(100 * count / sum(count))
)
) + labs(y = "Percentage")
# Computed variables are provided in the stat_ documentation
# ?stat_count
# ?stat_density
# ?stat_boxplot
Use stat_summary() to add a red dot at the mean depth for each group
ggplot(diamonds) +
geom_jitter(aes(x = cut, y = depth), width = 0.3)
Hint: You will need to change the default geom of stat_summary()
Scales map our input data to the graphical output. Scales define how the mapping you specify inside aes() should happen. All mappings have an associated scale even if not specified. For example, when you assign a variable to colour it determines what kind of values they are (discrete or continuous, binned) and scales them accordingly.
# ggplot looks at the Species variable, sees it has discrete values and automatically assigns a colour to the three Species values. It also creates a legend so you can map back to the points. If we don't like these colours we need to add a scale to update the default colours.
ggplot(iris) +
geom_point(aes(x = Petal.Length, y = Petal.Width, colour = Species))
# If we don't like the colours ggplot assigns we need to add a scale to update the default colours
# The brewer scales provide sequential, diverging and qualitative colour schemes from ColorBrewer.
# Type: One of seq (sequential), div (diverging) or qual (qualitative)
ggplot(iris) +
geom_point(aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
scale_color_brewer(type = 'qual')
Remember: You can add colours manually but be aware that colours and complex so you need to consider whether your chosen colours represent the data. For example, qualitative schemes should not imply magnitude differences. You should also consider how they are percieved by those with colour deficiencies.
# NICD branded colours
ggplot(iris) +
geom_point(aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
scale_color_manual(values = c("#002D30", "#4FE18F", "#A361FF"))
Positional mappings (x and y) also have associated scales. You can use scale to change break points, gridlines and transform the data. Each scale has built-in transformations but the default is normally set as ‘identity’ - do not transform that data. Built-in transformations include “asn”, “atanh”, “boxcox”, “date”, “exp”, “hms”, “identity”, “log”, “log10”, “log1p”, “log2”, “logit”, “modulus”, “probability”, “probit”, “pseudo_log”, “reciprocal”, “reverse”, “sqrt” and “time”.
# scale the continous x and y values
ggplot(iris) +
geom_point(aes(x = Petal.Length, y = Petal.Width)) +
scale_x_continuous(breaks = c(3, 5, 6)) +
scale_y_continuous(trans = 'log10')
> Remember: if you get stuck, there is a lot of documentation available for ggplot2.
Modify the code below to create a bubble chart (scatterplot with size mapped to a continuous variable) showing cyl with size. Make sure that only the present amount of cylinders (4, 5, 6, and 8) are present in the legend.
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy, colour = class, size = cyl)) +
scale_colour_brewer(type = 'qual') +
#scale_size(breaks = c(4,5,6,8)) #scales area
#scale_radius(breaks = c(4,5,6,8))
scale_size_area(breaks = c(4,5,6,8))
# the eye percieves the area not the radius.
Hint: The breaks argument in the scale is used to control which values are present in the legend.
You may wish to have multiple plotting areas. It is the layout of your subplots and the layout may have some meaning.
# facet_wrap() wraps a 1d sequence of panels into 2d.
ggplot(mpg) +
geom_point(aes(x=displ, y=hwy)) +
facet_wrap(~ class)
#facet_grid() forms a matrix of panels defined by row and column faceting variables. It is most useful when you have two discrete variables, and all combinations of the variables exist in the data. If you have only one variable with many levels, try facet_wrap().
ggplot(mpg) +
geom_point(aes(x=displ, y=hwy)) +
facet_wrap(year ~ drv)
Other coordinate systems
- coord_cartesian: most plots are drawn in a cartesian coordinate system
- coord_polar: interprets x and y as radius and angle
ggplot(data = diamonds) +
geom_bar(mapping = aes(x=cut)) +
coord_polar()
Remember: remember to us
coord_cartesian()rather than scale when you don’t want to affect your data.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x=cut))# +
#scale_y_continuous(limits = c(0, 12500))
#coord_cartesian(ylim=c(0, 12500))
Why focus on styling? Decision making is the key competency in data visualisation. We need to make effective decisions efficiently. We can use a design process to facilitate effective decision making and we can adopt a standard theme, based upon good design principals, to streamline the final presentation work.
theme() arguments# this code will not run
theme(
line,
rect,
text,
title,
aspect.ratio,
axis.title,
axis.title.x,
axis.title.x.top,
axis.title.x.bottom,
axis.title.y,
axis.title.y.left,
axis.title.y.right,
axis.text,
axis.text.x,
axis.text.x.top,
axis.text.x.bottom,
axis.text.y,
axis.text.y.left,
axis.text.y.right,
axis.ticks,
axis.ticks.x,
axis.ticks.x.top,
axis.ticks.x.bottom,
axis.ticks.y,
axis.ticks.y.left,
axis.ticks.y.right,
axis.ticks.length,
axis.ticks.length.x,
axis.ticks.length.x.top,
axis.ticks.length.x.bottom,
axis.ticks.length.y,
axis.ticks.length.y.left,
axis.ticks.length.y.right,
axis.line,
axis.line.x,
axis.line.x.top,
axis.line.x.bottom,
axis.line.y,
axis.line.y.left,
axis.line.y.right,
legend.background,
legend.margin,
legend.spacing,
legend.spacing.x,
legend.spacing.y,
legend.key,
legend.key.size,
legend.key.height,
legend.key.width,
legend.text,
legend.text.align,
legend.title,
legend.title.align,
legend.position,
legend.direction,
legend.justification,
legend.box,
legend.box.just,
legend.box.margin,
legend.box.background,
legend.box.spacing,
panel.background,
panel.border,
panel.spacing,
panel.spacing.x,
panel.spacing.y,
panel.grid,
panel.grid.major,
panel.grid.minor,
panel.grid.major.x,
panel.grid.major.y,
panel.grid.minor.x,
panel.grid.minor.y,
panel.ontop,
plot.background,
plot.title,
plot.title.position,
plot.subtitle,
plot.caption,
plot.caption.position,
plot.tag,
plot.tag.position,
plot.margin,
strip.background,
strip.background.x,
strip.background.y,
strip.placement,
strip.text,
strip.text.x,
strip.text.y,
strip.switch.pad.grid,
strip.switch.pad.wrap,
...,
complete = FALSE,
validate = TRUE
)
theme() constructs plots as a combination of:
The element_ functions specify the display of how non-data components of the plot are drawn.
element_blank() draws nothing, and assigns no spaceelement_rect() borders and backgroundselement_line() lineselement_text() textelement_rect()
# this code will not run
element_rect(
fill = NULL,
colour = NULL,
size = NULL,
linetype = NULL,
color = NULL,
inherit.blank = FALSE
)
# Colours
dark_green <- "#002D30"
light_grey <- "#F2F2F2"
green_blue <- "#007883"
light_green <- "#4FE18F"
blue <- "#00C3D6"
pink <- "#FF709D"
purple <- "#A361FF"
orange <- "#ffa45a"
# Font
# font <- "Derailed"
# The plot
ggplot(data = iris) +
geom_point(mapping = aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
labs(title = "Differences in flower size by iris species",
subtitle = "Plot of petal width by petal length",
caption = "Data source: R") +
# Theme
theme(
# You can use this to remove all lines from the plot
#rect = element_blank(),
# Plot rectangle format:
plot.background = element_rect(fill = pink),
# Panel rectangle format:
panel.background = element_rect(fill = dark_green),
#panel.border = element_rect(fill = light_grey, colour = pink), # you probably will never need to use this
legend.key = element_rect(fill = "white"),
legend.background = element_rect(fill = orange),
legend.margin = margin(9,9,9,9),
legend.box.background = element_rect(fill = purple, colour = purple),
legend.box.margin = margin(9,9,9,9),
)
element_line()
# this code will not run
element_line(
colour = NULL,
size = NULL,
linetype = NULL,
lineend = NULL,
color = NULL,
arrow = NULL,
inherit.blank = FALSE
)
# Colours
dark_green <- "#002D30"
light_grey <- "#F2F2F2"
green_blue <- "#007883"
light_green <- "#4FE18F"
blue <- "#00C3D6"
pink <- "#FF709D"
purple <- "#A361FF"
orange <- "#ffa45a"
# Font
# font <- "Derailed"
# The plot
ggplot(data = iris) +
geom_point(mapping = aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
labs(title = "Differences in flower size by iris species",
subtitle = "Plot of petal width by petal length",
caption = "Data source: R") +
# Theme
theme(
# You can use this to remove all lines from the plot
#line = element_blank(),
# Update the axis ticks
axis.ticks = element_line(colour = pink, lineend = "round", size = 2),
axis.ticks.length = unit(.5, "cm"),
# Update the axis lines
axis.line.x = element_line(colour = orange, lineend = "round", size = 2),
axis.line.y = element_line(colour = light_green, lineend = "round", size = 2),
# Update the grid lines
#panel.grid = element_line(colour = pink),
panel.grid.minor = element_line(colour = purple),
panel.grid.major = element_line(colour = blue)
)
element_text()
# this code will not run
element_text(
family = NULL,
face = NULL,
colour = NULL,
size = NULL,
hjust = NULL,
vjust = NULL,
angle = NULL,
lineheight = NULL,
color = NULL,
margin = NULL,
debug = NULL,
inherit.blank = FALSE
)
# Colours
dark_green <- "#002D30"
light_grey <- "#F2F2F2"
green_blue <- "#007883"
light_green <- "#4FE18F"
blue <- "#00C3D6"
pink <- "#FF709D"
purple <- "#A361FF"
orange <- "#ffa45a"
# Font
# font <- "Derailed"
# The plot
ggplot(data = iris) +
geom_point(mapping = aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
labs(title = "Differences in flower size by iris species",
subtitle = "Plot of petal width by petal length",
caption = "Data source: R") +
# Theme
theme(
# Title text format:
# Set the font family and font colour
#text = element_text(family=font,
# color=purple),
# Make additional changes (e.g., size, type, margins) to the plot's title, subtitle and caption and the legend title and text here.
plot.title = element_text(size=18,
face="bold",
colour = purple
),
plot.subtitle = element_text(size=12,
margin=margin(9,0,9,0),
colour = blue
),
plot.caption = element_text(size=10,
colour = pink),
# Legend text format:
legend.title = element_text(size=12,
face = "bold",
colour = orange),
legend.text = element_text(size=12,
colour = dark_green),
# Axis text format:
axis.title.y = element_text(size=12,
colour = light_green),
axis.title.x = element_text(size=12,
colour = blue),
# SET AS BLANK?
# Notice that the font is pulled through in the text hierarchy but colour is not for axis.text or strip.text
axis.text = element_text(size=10,
colour = green_blue),
# Strip text
strip.text = element_text(size=12,
hjust = 0,
colour = orange)
)
For this exercise we will work with the following example plot.
ggplot(data = iris) +
geom_point(mapping = aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
facet_grid(cols = vars(Species)) +
labs(title = "Differences in flower size by iris species",
subtitle = "Plot of petal width by petal length",
caption = "Data source: R")
Task:
theme() documentationggplot(data = iris) +
geom_point(mapping = aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
facet_grid(cols = vars(Species)) +
labs(title = "Differences in flower size by iris species",
subtitle = "Plot of petal width by petal length",
caption = "Data source: R") +
theme(
# RECTANGLES
# element_rect()
#plot.background = element_rect(),
#panel.background = element_rect(),
#legend.key = element_rect(),
#legend.background = element_rect(),
#legend.margin = margin(),
#legend.position(),
#legend.direction(),
#legend.justification(),
#legend.box.background = element_rect(),
#legend.box.margin = margin(),
#panel.spacing(),
# LINES
#axis.ticks = element_line(),
#axis.ticks.length = unit(.5, "cm"),
#axis.line = element_line()
#axis.line.x = element_line(),
#axis.line.y = element_line(),
#panel.grid = element_line(),
#panel.grid.minor = element_line(),
#panel.grid.major = element_line(),
# TEXT
#text = element_text(),
#plot.title = element_text(),
#plot.subtitle = element_text(),
#plot.caption = element_text(),
#legend.title = element_text(),
#legend.title.align(),
#legend.text = element_text(),
#legend.text.align(),
#axis.title = element_text(),
#axis.title.y = element_text(),
#axis.title.x = element_text(),
#axis.text = element_text(),
#strip.text = element_text()
)